-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: store pins in datastore instead of a DAG #2771
Conversation
|
Adhoc testing script. Add a buffer without pinning it, time how long it takes to pin it. Store the time and work out the average time taken every 100 pins: 'use strict'
const last = require('it-last')
const drain = require('it-drain')
const { createController } = require('ipfsd-ctl')
async function main () {
const ipfs = (await createController({
type: 'go',
ipfsBin: require('go-ipfs-dep').path(),
ipfsHttpModule: require('ipfs-http-client'),
disposable: false
}))
await ipfs.init()
await ipfs.start()
let times = []
let chunk = 0
for (let i = 0; i < 83000; i++) {
const buf = Buffer.from(`${Math.random()}`)
const result = await last(ipfs.api.add(buf, {
pin: false
}))
const start = Date.now()
const res = await ipfs.api.pin.add(result.cid)
if (res[Symbol.asyncIterator]) {
await drain(res)
}
const mem = process.memoryUsage()
times.push({
...mem,
elapsed: Date.now() - start
})
chunk++
if (chunk === 1000) {
const sum = times.reduce((acc, curr) => {
acc.elapsed += curr.elapsed
acc.rss += curr.rss
acc.heapTotal += curr.heapTotal
acc.heapUsed += curr.heapUsed
acc.external += curr.external
return acc
}, { elapsed: 0, rss: 0, heapTotal: 0, heapUsed: 0, external: 0 })
console.info(`${i + 1}, ${sum.elapsed / times.length}, ${sum.rss / times.length}, ${sum.heapTotal / times.length}, ${sum.heapUsed / times.length}, ${sum.external / times.length}`)
chunk = 0
times = []
}
}
await ipfs.stop()
}
main()Results: 10k pins, DAG vs datastore, ranges from 20-300x speedup in time taken to add a single pin: After 100k pins, there doesn't seem to be much performance degredation in storing in the datastore whereas the DAG method degrades significantly after 8192 pins (see #2197 for discussion of that): The next significant performance jump vs DAGs would probably be after the first layer of buckets is full - e.g. 256 buckets of 8192 pins = 2,097,152 pins. That'll probably take a bit of time to benchmark... |
|
Next steps:
|
|
That's a very cool speed improvement! Some observations:
|
I guess you could only store the cid version/codec in the pin? I was thinking of changing the pin type to be an integer too, so there are definitely some improvements that can be made, this is just a first pass.
@Stebalien has talked about making a similar change to this too so it's only slightly ahead of the go-ipfs repo. At any rate, go-ipfs is switching to badger by default which js-ipfs can't read so I'm not sure how much of a priority that is any more.
I guess you can't share your entire list of pins by sharing one CID, but also now do you don't have to share your entire list of pins, you can share individual ones. Grouping multiple pins as pinsets could be added back in as a new feature, the human readable names would make this nicer to work with. Something like: $ ipfs pin add Qmfoo
pinned Qmfoo recursively
$ ipfs pin-set add my-super-fun-pinset Qmfoo
$ ipfs pin-set list my-super-fun-pinset
my-super-fun-pinset Qmqux
Qmfoo
Qmbar
QmbazYou could event have the root of a pinset be an IPNS name to allow pulling updates from the network. That'd be neat. |
This can be fixed :).
At the moment, I think this is causing strictly more harm than help. It's been 6 years and I have yet to see someone use this. Ideally, everything would be stored in an IPLD-backed graph database. However, we aren't there yet in terms of tooling. We could get part way there by creating an IPLD-backed datastore (datastore -> IPLD HAMT -> datastore) but that will throw away the type information.
Any reason to store the CID?
Base64url? Go-ipfs, at least, now has hyper-optimized base58. However, it's still slower than base64 (and takes more space). Questions/comments.
|
43316eb to
80986f9
Compare
|
Some more graphs. I pinned 83k single blocks using the test script above (originally intended to be 100k but the js-dag benchmark took too long to run and I had to get on an aeroplane). The initial hump at 8192 pins is there, then a consistent performance degradation over time. At 83k pins, js is taking 2.5s to add a pin. Go has the same degradation but it is significantly less pronounced. The js-dag implementation stores the pinsets in memory, js-datastore does not. There is an increase in memory usage over time but it's may not be hitting the v8 gc threshold, or there's a leak somewhere...
The sizes appear to be comparable, or perhaps they are statistically insignificant compared to the block size. After completing the benchmark and running repo gc I see: # js-dag
.jsipfs $ du -hs
367M .
# go-dag
.ipfs $ du -hs
353M .
# js-datastore
.jsipfs $ du -hs
344M .
Yes, this is the idea behind storing them CBOR encoded rather than protobufs.
Good suggestion, names are not unique so comments might be a better field name.
If we're not going let the user query by name we probably shouldn't do this.
My thinking was that by using the multihash of a block as the pin identifier (not the full CID), it becomes cheap to calculate if a given block has already been pinned (assuming the user has hashed it with the same algorithm). The full CID is stored so we can show the user what they used to pin the block when they do a |
|
cc @hsanjuan |
Adds a `.pins` datastore to `ipfs-repo` and uses that to store pins as cbor
binary keyed by base64 stringified multihashes (n.b. not CIDs).
Each pin has several fields:
```javascript
{
cid: // buffer, the full CID pinned
type: // string, 'recursive' or 'direct'
comments: // string, human-readable comments for the pin
}
```
BREAKING CHANGES:
* pins are now stored in a datastore, a repo migration will be necessary
* ipfs.pins.add now returns an async generator
* ipfs.pins.rm now returns an async generator
Depends on:
- [ ] ipfs/js-ipfs-repo#221
a5ec30e to
582be49
Compare
The changes in ipfs/js-ipfs#2771 mean that the input/output of `ipfs.pins.add` and `ipfs.pins.rm` are now streaming so this PR updates to the new API.
Adds a `.pins` datastore to `ipfs-repo` and uses that to store pins as cbor binary keyed by multihash.
### Format
As stored in the datastore, each pin has several fields:
```javascript
{
codec: // optional Number, the codec from the CID that this multihash was pinned with, if omitted, treated as 'dag-pb'
version: // optional Number, the version number from the CID that this multihash was pinned with, if omitted, treated as v0
depth: // Number Infinity = recursive pin, 0 = direct, 1+ = pinned to a depth
comments: // optional String user-friendly description of the pin
metadata: // optional Object, user-defined data for the pin
}
```
Notes:
`.codec` and `.version` are stored so we can recreate the original CID when listing pins.
### Metadata
The intention is for us to be able to add extra fields that have technical meaning to the root of the object, and the user can store application-specific data in the `metadata` field.
### CLI
```console
$ ipfs pin add bafyfoo --metadata key1=value1,key2=value2
$ ipfs pin add bafyfoo --metadata-format=json --metadata '{"key1":"value1","key2":"value2"}'
$ ipfs pin list
bafyfoo
$ ipfs pin list -l
CID Name Type Metadata
bafyfoo My pin Recursive {"key1":"value1","key2":"value2"}
$ ipfs pin metadata Qmfoo --format=json
{"key1":"value1","key2":"value2"}
```
### HTTP API
* '/api/v0/pin/add' route adds new `metadata` argument, accepts a json string
* '/api/v0/pin/metadata' returns metadata as json
### Core API
* `ipfs.pin.addAll` accepts and returns an async iterator
* `ipfs.pin.rmAll` accepts and returns an async iterator
```javascript
// pass a cid or IPFS Path with options
const { cid } = await ipfs.pin.add(new CID('/ipfs/Qmfoo'), {
recursive: false,
metadata: {
key: 'value
},
timeout: 2000
}))
// pass an iterable of CIDs
const [{ cid: cid1 }, { cid: cid2 }] = await all(ipfs.pin.addAll([
new CID('/ipfs/Qmfoo'),
new CID('/ipfs/Qmbar')
], { timeout: '2s' }))
// pass an iterable of objects with options
const [{ cid: cid1 }, { cid: cid2 }] = await all(ipfs.pin.addAll([
{ cid: new CID('/ipfs/Qmfoo'), recursive: true, comments: 'A recursive pin' },
{ cid: new CID('/ipfs/Qmbar'), recursive: false, comments: 'A direct pin' }
], { timeout: '2s' }))
```
* ipfs.pin.rmAll accepts and returns an async generator (other input types are available)
```javascript
// pass an IPFS Path or CID
const { cid } = await ipfs.rm(new CID('/ipfs/Qmfoo/file.txt'))
// pass options
const { cid } = await all(ipfs.rm(new CID('/ipfs/Qmfoo'), { recursive: true }))
// pass an iterable of CIDs or objects with options
const [{ cid }] = await all(ipfs.rmAll([{ cid: new CID('/ipfs/Qmfoo'), recursive: true }]))
```
Bonus: Lets us pipe the output of one command into another:
```javascript
await pipe(
ipfs.pin.ls({ type: 'recursive' }),
(source) => ipfs.pin.rmAll(source)
)
// or
await all(ipfs.pin.rmAll(ipfs.pin.ls({ type: 'recursive'})))
```
BREAKING CHANGES:
* pins are now stored in a datastore, a repo migration will occur on startup
* All deps of this module now use Uint8Arrays in place of node Buffers





Adds a
.pinsdatastore toipfs-repoand uses that to store pins as cbor binary keyed by base32 encoded multihashes (n.b. not CIDs).Format
As stored in the datastore, each pin has several fields:
Notes:
.codecand.versionare stored so we can recreate the original CID when listing pins.Metadata
The intention is for us to be able to add extra fields that have technical meaning to the root of the object, and the user can store application-specific data in the
metadatafield.CLI
HTTP API
metadataargument, accepts a json stringFuture tech:
/default/C19A797...,/my-namespace/C19A797...ipfs pin ls --namespace=my-namespaceipfs pin query metadata.key1=value1Core API
ipfs.pin.addAllaccepts and returns an async iteratoripfs.pin.rmAllaccepts and returns an async iteratorBonus: Lets us pipe the output of one command into another:
Todo:
Depends on:
BREAKING CHANGES: